Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataframe data format in native XGBoost. #9828

Merged
merged 2 commits into from
Dec 12, 2023

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Nov 30, 2023

  • Implement a columnar adapter.
  • Refactor Python pandas handling code to avoid converting into a single numpy array.
  • Add support in R for transforming columns.

This is not yet as efficient as we would like since we can't handle missing value indicators for each column. I will leave that as a future to-do instead.

Categorical data should work with R factor now, minus the uncertainty about how we should suggest people to use a consistent encoder.

related: #9810 .

@trivialfis
Copy link
Member Author

@david-cortes Could you please help take a look into the R interface?

Copy link
Contributor

@david-cortes david-cortes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments. I guess I could add int64 support later.

R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
python-package/xgboost/data.py Outdated Show resolved Hide resolved
R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved
R-package/R/xgb.DMatrix.R Show resolved Hide resolved
@trivialfis
Copy link
Member Author

For some reason, even if just creating DMatrix (simplest possible task), the np array created from arrow extension is incredibly inefficient. I have checked the resulting DMatrices have the same hash. In addition, they run in the exact same code path inside libxgboost.

@trivialfis
Copy link
Member Author

Got it, it's caused by meta info instead of the covariate.

Copy link
Contributor

@david-cortes david-cortes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. I'm curious in particular about how these types interplay with feature types.

R-package/R/xgb.DMatrix.R Outdated Show resolved Hide resolved
}
}))
## as.data.frame somehow converts integer/logical into real.
data <- as.data.frame(sapply(data, function(x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does the result here need to be a data.frame? Maybe you could just call lapply, as I think it would avoid one list copy. Note that, per the other comments I'm leaving here, if integer columns have any missing values, they might need to be coerced through e.g. as.numeric.
  2. Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as.data.frame does the coercing for me.

Comment above says that it convers integers to real. Isn't the idea with the types above to handle them in their native type

Ideally, we would like to handle different types of columns independently without any coercing, and hence without any data copying. However, at the moment only cuDF input can be consumed in this way due to missing value handling. R uses sentinel values to indicate missing/NA, while XGBoost can't have more than one missing value indicator at the moment. As a result, a DF containing a float column and an integer column with NAs can confuse XGBoost what value it should eliminate. Is it NaN or NA(int)?

The cuDF uses arrow IPC format as its memory layout and exposes them as part of the API, missing values are represented by a bitmask, we can handle all the columns without any transformation (except for categorical encoding).

} else if (is.integer(x)) {
"int"
} else if (is.logical(x)) {
"i"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question there: I see this is also being used in the pandas adapter.

In R, boolean (logical) types are represented as C int, where possible values are FALSE (zero), NA (-INT_MAX), and TRUE (everything else), while python's bool type has only True and False.

I see you mention in a comment later that these get converted to numeric type, but the C++ code still checks for integer/logical-typed columns.

What would happen with these missing values encoded as -INT_MAX if the columns are supplied in their original types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by the comments in C++, those C++ handling code is not used but is more or less a reminder that we should try to avoid data transformation in R. I think the previous reply might help with the -INT_MAX part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the code if it's hindering readability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the code if it's hindering readability

I actually was thinking something along the lines that using sapply instead of lapply + unlist would avoid one list copy operation. Haven't checked this hypothesis though. I don't think the code is unreadable or hard to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I removed the unlist as suggested in #9828 (comment) .

R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved
* @brief Create a DMatrix from columnar data. (table)
*
* @param data See @ref XGBoosterPredictFromColumnar for details.
* @param config See @ref XGDMatrixCreateFromDense for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I'm wondering here: if this config already conveys the information about whether a column has integer type, is it actually needed to make a distinction between q and int in feature_types?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columns don't convey the information accurately since we need to do some transformations before passing them into XGBoost. For instance, if a column is integer with missing values, we have to use float with NaN as an approximate.

@trivialfis
Copy link
Member Author

I'm curious in particular about how these types interplay with feature types.

Other than the c type (for categorical), others don't have any practical implication on how the tree is built and are only for nicer plotting.

@trivialfis
Copy link
Member Author

cc @david-cortes DF for label and base margin is still not yet supported. These are useful for multi-output/multi-label problems. But we can work on them later.

- Implement a columnar adapter.
- Refactor Python pandas handling code to avoid converting into numpy.
- Add support in R for transforming columns.
@trivialfis
Copy link
Member Author

Added the cnames configuration back for the matrix.

@trivialfis
Copy link
Member Author

@david-cortes Could you please help take another look?

@@ -58,19 +61,28 @@ xgb.DMatrix <- function(
qid = NULL,
label_lower_bound = NULL,
label_upper_bound = NULL,
feature_weights = NULL
feature_weights = NULL,
enable_categorical = FALSE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo, for example?

If so, how about renaming it to 'autodetect_categorical' or something along those lines? (both in the R and Python interfaces) Would also be ideal to describe a bit more of it in the docs (e.g. that it's only for data frames).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do I understand it correctly that this parameter is only used to auto-detect categorical features from data frames, but would otherwise play no role if e.g. the user were to manually set this field in the DMatrix later through setinfo, for example?

Correct. It's more or less a guard to prevent surprise since XGBoost didn't accept categorical data before, which might cause issues in silence if we suddenly accept it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong preference on the naming, we have an introductory document for cat data in the tutorials, feel free to add additional explanation.

@david-cortes
Copy link
Contributor

LGTM. Left two small comments.

@trivialfis trivialfis merged commit faf0f2d into dmlc:master Dec 12, 2023
30 checks passed
@trivialfis trivialfis deleted the data-columnar branch December 12, 2023 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants